Skip to content

add disaster-recovery component#10686

Draft
filariow wants to merge 4 commits intoredhat-appstudio:mainfrom
filariow:add-dr
Draft

add disaster-recovery component#10686
filariow wants to merge 4 commits intoredhat-appstudio:mainfrom
filariow:add-dr

Conversation

@filariow
Copy link
Member

  • add disaster-recovery to the development overlay
  • add eventlistener and cronjob

This only affects the development overlay

Signed-off-by: Francesco Ilario <filario@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
use tekton's eventlistener and trigger plus a cronjob
to execute a pipeline every hour

cf. https://github.com/tektoncd/triggers/tree/main/examples/v1beta1/cron

Signed-off-by: Francesco Ilario <filario@redhat.com>
@openshift-ci
Copy link

openshift-ci bot commented Feb 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Feb 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: filariow
Once this PR has been reviewed and has the lgtm label, please assign simonbaird for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🤖 Gemini AI Assistant Available

Hi @filariow! I'm here to help with your pull request. You can interact with me using the following commands:

Available Commands

  • @gemini-cli /review - Request a comprehensive code review

    • Example: @gemini-cli /review Please focus on security and performance
  • @gemini-cli <your question> - Ask me anything about the codebase

    • Example: @gemini-cli How can I improve this function?
    • Example: @gemini-cli What are the best practices for error handling here?

How to Use

  1. Simply type one of the commands above in a comment on this PR
  2. I'll analyze your code and provide detailed feedback
  3. You can track my progress in the workflow logs

Permissions

Only OWNER, MEMBER, or COLLABORATOR users can trigger my responses. This ensures secure and appropriate usage.


This message was automatically added to help you get started with the Gemini AI assistant. Feel free to delete this comment if you don't need assistance.

@github-actions
Copy link
Contributor

🤖 Hi @filariow, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@meyrevived
Copy link
Contributor

Hey @filariow, so adding disaster recovery to the development overlay is for the e2e-tests. In the e2e-tests, the backup and recovery are all done programmatically through Ginkgo code - it just needs the infrastructure to be available (MinIO + OADP sitting there, ready) not an automatic DR action.
The backup action has its own ApplicationSet, with cluster label selectors here. This PR does things in a completely different model - could you explain more about why this was is how you proposed to do things?

What e2e-tests needs for the DR effort is just to have a development/ dir here with MinIO + OADP manifests and also to ensure the existing backup ApplicationSet routes dev clusters to that overlay.

@filariow
Copy link
Member Author

filariow commented Mar 2, 2026

Hey @filariow, so adding disaster recovery to the development overlay is for the e2e-tests. In the e2e-tests, the backup and recovery are all done programmatically through Ginkgo code - it just needs the infrastructure to be available (MinIO + OADP sitting there, ready) not an automatic DR action.

what this PR adds is not to be used in the development overlay, it's meant for staging. In development we just want to test changes to its manifests are sound before they are promoted to staging. In the e2e-tests executed in the development overlay we can also suspend the cronjob. In staging it will be used to execute the test periodically.

The backup action has its own ApplicationSet, with cluster label selectors here. This PR does things in a completely different model - could you explain more about why this was is how you proposed to do things?

That's because for the scope of this PR we want to target dev only, not to all labeled clusters.

Signed-off-by: Francesco Ilario <filario@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

Kustomize Render Diff

Comparing 565bdcd162320015cc

Component Environment Changes
components/disaster-recovery/development development +164 -0
components/etcd-shield/production production build error
components/etcd-shield/production/kflux-fedora-01 production +1 -1
components/etcd-shield/production/kflux-ocp-p01 production +1 -1
components/etcd-shield/production/kflux-osp-p01 production +1 -1
components/etcd-shield/production/kflux-prd-rh02 production +1 -1
components/etcd-shield/production/kflux-prd-rh03 production +1 -1
components/etcd-shield/production/kflux-rhel-p01 production +1 -1
components/etcd-shield/production/stone-prd-rh01 production +1 -1
components/etcd-shield/production/stone-prod-p01 production +1 -1
components/etcd-shield/production/stone-prod-p02 production +1 -1
components/monitoring/prometheus/production/kflux-fedora-01 production +614 -0
components/disaster-recovery//empty-base staging build error
components/external-secrets-operator/staging staging +3 -3
components/monitoring/prometheus/staging/base staging +1 -2
components/monitoring/prometheus/staging/kflux-stg-es01 staging +1 -2
components/monitoring/prometheus/staging/stone-stage-p01 staging +1 -2
components/monitoring/prometheus/staging/stone-stg-rh01 staging +1 -2
components/multi-platform-controller/staging staging +2 -9
components/multi-platform-controller/staging-downstream staging +2 -9

Total: 20 components, +798 -38 lines

📋 Full diff available in the workflow summary and as a downloadable artifact.

Signed-off-by: Francesco Ilario <filario@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@meyrevived
Copy link
Contributor

@filariow

In development we just want to test changes to its manifests are sound before they are promoted to staging.

By what test? The test suites for e2e-tests might be disrupted by the cron job being triggered every hour and, it definitely will need to be suspended in the ITSes planned. Do you plan on something else?

@filariow
Copy link
Member Author

filariow commented Mar 2, 2026

@filariow

In development we just want to test changes to its manifests are sound before they are promoted to staging.

By what test? The test suites for e2e-tests might be disrupted by the cron job being triggered every hour and, it definitely will need to be suspended in the ITSes planned. Do you plan on something

just by the fact that ArgoCD can install them successfully and proceed running the e2e-tests. In the development overlay we can patch the cronjob to do not execute (spec.suspended: true) or patch the pipeline to run a no-op.
This way e2e tests running on every PRs won't be impacted and, before we promote them to staging, we'll validate that the changes to our manifests are sound and ArgoCD managed to apply them in the development environment .

@eisraeli
Copy link
Contributor

eisraeli commented Mar 2, 2026

@filariow
Given that we currently maintain the backup ArgoCD application within our disaster recovery scope, would it be worth considering a consolidation of these two components?

name: run-disaster-recovery-pipelinerun
namespace: konflux-disaster-recovery
spec:
schedule: "0 * * * *" # every hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filariow Don't we want to run it daily ? Every hour is too frequent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree it could be too frequent for the real use case. However, right now this targets the development overlay only and it executes a dummy pipeline. Let's agree on the right schedule when we target staging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants